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Abstract 

Degrading performance of indexing schemes for exact similarity search in high dimensions has long since been 
linked to histograms of distributions of distances and other 1-Lipschitz functions getting concentrated. We 
discuss this observation in the framework of the phenomenon of concentration of measure on the structures 
of high dimension and the Vapnik-Chervonenkis theory of statistical learning. 
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1. Introduction 

At an intuitive level, at least for a limited class 
of indexing schemes the geometric and probabilis- 
tic origin of the curse of dimensionality is quite 
transparent. Let W = (fl,p,X) denote a similarity 
workload, where p is a metric on a domain f2 and 
X is a finite subset of ft (dataset). Let us say we 
are interested in indexing into W for deterministic, 
exact range queries. A traditional "distance-based" 
indexing scheme, stripped down to the bone, con- 
sists of a family of real- valued functions /(, t 6 J 
on f2, either fully or partially defined, which satisfy 
the 1-Lipschitz property: 

\fi(x)-fi(y)\< P (x,y). (1) 

(For example, a pivot-based indexing scheme will 
be using distance functions p{pi,—) to the pivots 
Pi € Cl.) Given a range query (q, e), where g £ !1 
and £ > 0, the algorithm chooses recursively a se- 
quence of indices i n , where each i n +i is determined 
by the values fi k {q), k < n. The functions ft serve 
to discard those datapoints which cannot possibly 
answer the query. Namely, if \fi(q) — fi{x)\ > s, 
then, by the 1-Lipschitz property of /i, one has 
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p(q, x) > e, and so the point x is irrelevant and 
need not be considered (Figure [TJ. 




Figure 1: The datapoint x can be discarded. 

After the calculation terminates, the algorithm 
returns all points which cannot be discarded, and 
checks each one of them against the condition 
p(x,q) < e. 

Next come two standard observations about high- 
dimensional data. The first one, known as the 
"empty space paradox," asserts that the aver- 
age distance E(enn) to the nearest neighbour ap- 
proaches the average distance K(p) between two 
datapoints as the dimension d goes to infinity, pro- 
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vided the number of datapoints, n, grows subexpo- 
nentially in d. Cf. Figure [U where we illustrate the 
point with a constant number of points (n = 10 3 
and n = 10 5 ), and the distances are normalized so 
that the characteristic size of the gaussian space 
(R", 7 n ), 



CharSizepT) = E M g) M (/3(x,y)), 



(2) 



is one. 
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Figure 2: The normalized average distance to the nearest 
neighbour in a dataset of n points randomly drawn from a 
gaussian distribution in M. d . 

The second observation is that the histograms 
of values of common 1-Lipschitz functions on high- 
dimensional data are concentrated near their mean 
(or median) values. This effect is already pro- 
nounced in moderate dimensions such as d = 14 
in Figure [31 Here the function is a distance to a 
randomly chosen pivot p, and assuming the query 
point q is at a distance ~ 1 from p, only the points 
outside of the region marked by vertical bars can 
be discarded. 

The two properties combined imply that as d — > 
co, fewer and fewer datapoints can be discarded for 
an average range query, and the performance of an 
indexing scheme degrades rapidly. This mechanism 
has been discussed repeatedly, e.g. Q, pp. 35-37, 
371 ] . 471 ] . p. 487, to mention just a few sources. 

To make this idea yield rigorous lower perfor- 
mance bounds, one needs to guarantee first that ev- 
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Figure 3: Histogram of distances to a randomly chosen pivot 
in a dataset X of n = 10 points drawn from a gaussian 
distribution in R 14 . The vertical lines mark the mean nor- 
malized distance 1 ± £jvi\r- 



cry histogram of distances of 1-Lipschitz functions 
used to build an indexing scheme for a given domain 
is highly concentrated. In other words, if & de- 
notes a class of 1-Lipschitz functions from which we 
can choose the /j, then we want a low uniform up- 
per bound on the variances of / g & . Results of 
this type are indeed well-known for a variety of ge- 
ometric objects and are referred jointly as the phe- 
nomenon of concentration of measure 2l|, 3^, 13] • 

Next problem is, how to link the concentration of 
functions / with regard to the presumed underly- 
ing distribution fi on the domain Q to concentration 
with regard to the empirical measure fi n supported 
on the dataset X (this was essentially a criticism of 
14211 made in 



481]])? Here one needs the machinery 
of statistical learning theory of Vapnik and Cher- 
vonenkis 5^, 2|, flH l^tJ , which can guarantee such 
results provided the class & has low combinatorial 
complexity (e.g., a finite VC dimension). This way, 
one obtains f2(n/dlgn) lower bounds for the pivot 
table expected average performance 58] , as well as 
superpolynomial in d lower bounds for metric trees 

m 

Approximate NN queries [24l |40( seem to be in 
some sense free from the curse of dimensionality. In 
fact, the concentration of measure becomes a posi- 
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tive force here, and we will try to explain why, using 
the example of random projections in the Hamming 
cube (the approach of Kushilevitz, Ostrovsky and 
Rabani [1^]), as well as the Euclidean space (the 
Johnson-Lindenstrauss lemma [25|j ) . 

Getting back to exact search, the Curse of Di- 
mensionality Conjecture [23j calls for a general 
statement about lower bounds, which would ap- 
ply across the entire range of all possible indexing 
schemes. The conjecture is still open even for the 
Hamming cube {0, 1}", and we discuss it briefly. 

We conclude the article with a few remarks on 
the notion of intrinsic dimensionality of data, on 
a black-box search model of Krauthgamer and Lee 
[28| , as well as on a spatial approximation algorithm 
based on Delaunay graphs [37 1. 
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Figure 4: To the concept of the concentration function 
an(e). 



2. Concentration 

2.1. The concentration of measure phenomenon 

Informally, the phenomenon can be stated as fol- 
lows: 



(obtained by combining Harper's isoperimetric in- 
equality, see e.g. [13], with the classical Chernoff 
bound, cf. [13, 2.2.1). See Figure^ 



On a typical "high-dimensional" structure 
Q, every 1-Lipschitz function /: f2 — > [0, 1] 
has small variation. 

Usually, however, concentration is being dealt 
with using a different dispersion parameter. We 
proceed to precise definitions. 

Let a metric space (f2, p) carry a probability mea- 
sure [i. Such an object is called a metric space with 
measure. One defines the concentration function an 
of by setting an(0) — 1/2 and, for e > 0, 

atn(e) = 1 - inf I n (A £ ) : n(A) > - 

The value an(e) gives a uniform upper bound 
on the measure of the complement to the e- 
neighbourhood A e of every subset A of measure 
> 1/2, cf. Fig. m 

On a typical high-dimensional geometric object 
f2 the function a(e) drops off steeply near zero. For 
regular geometric objects such as Hamming cubes, 
Euclidean unit spheres and so on, one can usually 
derive gaussian upper bounds of the form 

a(e) < exp(-0(e 2 rf)), 

where d is the dimension parameter. 

For example, the Hamming cube {0, l} d with 
the normalized Hamming metric and uniform mea- 
sure satisfies a Chernoff bound a(e) < exp(— 2e 2 d) 
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Figure 5: Concentration function of the Hamming cube of 
dimension d = 100 vs Chernoff bound. 

It follows easily that for every real-valued 1- 
Lipschitz function / on Q and for each s > one 
has 



H{x€il:\f{x)-M f \ >e}<2a n (e) 



(3) 



where Mf is the median value of /, that is, a (gen- 
erally non-unique) real number with the property 
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that for a randomly drawn x € Cl the probabili- 
ties of the events [f(x) > M] and [f(x) < M] are 
at least 1/2 each. One can further derive uniform 
upper bounds in terms of an on the variances of 1- 
Lipschitz functions on with values in a bounded 
interval. 

The concentration phenomenon admits the fol- 
lowing illustration. Draw 1,000 points randomly 
from a high-dimensional geometric object such as 
the unit cube I d centred at the origin, choose a ran- 
dom orthogonal projection onto a two-dimensional 
subspace, and project both the cube and the cho- 
sen points on this subspace. The points will con- 
centrate near the centre, the more so the higher the 
dimension d is, as seen in Figure [6] for the values of 
dimension d = 3, 10, 100 and 1000. The red outline 
is the two-dimensional projection of the cube. 




server. For instance, a certain precise version of 
this statement holds true for all convex bodies, as 
recently proved by Klartag 27]. For this reason, 
for an asymptotic study of performance of index- 
ing schemes when d — > oo the choice of a particular 
family of domains (Euclidean spheres, balls, cubes, 
Hamming cubes) does not matter that much. 

Among the books treating the concentration phe- 
nomenon, [35[ is the most reader-friendly, [30( most 
comprehensive, and_|20j contains a wealth of ideas. 
See also a survey [34|. 

2.2. Asymptotic assumptions on the similarity 
workload 

Let us agree on the following four assumptions 
on the similarity workload: 

2.2.1. Domain as a metric space with measure 
The metric domain (CI, p) is equipped with a 

probability measure /i, and datapoints are drawn 
from U, in an i.i.d. fashion following the distribu- 
tion \x. 

(This is the model used in which of course agrees 
with the traditional statistical approach to data mod- 
elling.) 

2.2.2. Normalization of the distance 

The distance p on the domain is normalized so 
that the characteristic size of f2 is constant: 

CharSize(O) = E^(p) = 0(1). 

(Every domain can be renormalized in the above fash- 
ion unless the expected distance between the two points 
is infinite, which does not appear to be a realistic as- 
sumption anyway.) 



Figure 6: Orthogonal projection of a unit Euclidean d-cube 
and of 1,000 random points inside the cube on a random 
2-subspace, d = 3 (top left), d = 10 (top right), d = 100 
(bottom left), d = 1000 (bottom right). 

Another noteworthy concequence of concentra- 
tion is that the shape of the random projection of 
the cube is getting ever more similar to a disk as 
d — >• oo. In fact, for d = 1000 the only visual differ- 
ence one can spot between a random projection of 
a unit cube and that of a unit sphere, is the scale of 
the two projections: the diameter of a unit <i-cube 
is 0(Vd). 

This illustrates an interesting feature of geometry 
of high dimensions: many high-dimensional objects 
look essentially the same to a low-dimensional ob- 



2.2.3. Growing instrinsic dimension 

fl has "intrinsic dimension d" in the sense that 
the concentration function of the metric space with 
measure (Cl, p, pi) admits a gaussian upper bound 

an(e) = exp(-Cl(e 2 d)). 

(Such an approach to intrinsic dimensionality is de- 
veloped in 0,0.) 

2.2.4. Size of a dataset 

The number n of datapoints grows faster than 
any polynomial function in d, but slower than any 
exponential function in d: 

n = d u>{1 \ d = cj(logn). (4) 
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(This is a standard assumption in the asymptotic anal- 
ysis of indexing schemes for similarity search, cf. [23I ]. 
An example of such a rate of growth is n = 2^.) 

Note that randomly drawing a single dataset 
X C ft with n points amounts to randomly draw- 
ing a single point in the n-th power of the domain, 
ft™, equipped with the product probability measure 
fj,® n . In order to perform asymptotic analysis of 
indexing scheme performance, we will in fact be 
choosing an infinite sequence of datasets Xd C ft^, 
d = 1,2,.... This is equivalent to drawing a single 
point x (sample path) in the infinite product 

ft™ 1 x ft^ 2 x . . . x ft" d x . . . , 

with regard to the corresponding infinite product 
of probability measures: 

X ~ /if ni <g> /if d2 <g> . . . <g> /if™" . . . . 

When talking about confidence, we will mean the 
product probability in the above infinite prod- 
uct space. Specifically, a statement Q(d,x) 
parametrized by the dimension d and taking as a 
variable the sample path x occurs with (asymptoti- 
cally) high confidence if for every 5 > there is D 
so that 

P[Q(d,X) is true ]> 1- S 

whenever d > D. 

At the same time, in order to keep the notation 
simple, we will suppress the dimension index d and 
talk just of a single domain ft and a dataset X C ft. 

2.3. Empty space paradox 

Denote e nn the nearest neighbour distance func- 
tion on ft, given by Enn(u) = p((J,X). 

Theorem 2.1. Under our standing assumptions on 
the workload, for every e > one has with asymp- 
totically high confidence that for all points u) € ft 
except for a set of measure exp(— fl(e 2 d)) 

\enn(u) — CharSize(ft)| < e. 

The result applies to the Hamming cube, the Eu- 
clidean cube, the Euclidean space with gaussian 
measure, the Euclidean ball, etc. 

As a byproduct of the technique, one obtains: 

Proposition 2.2. Under the same assumptions, 
for every e > the pairwise distances between dat- 
apoints of X are all in the range CharSize (ft) ± e 
with asymptotically high confidence. 

For constant n = \X\ and the case of a Euclidean 
domain the result was established in [22j ■ 
Our proofs can be found in Appendix A. 



3. VC theory 

3.1. VC dimension 

Let ^ denote a collection of subsets of the domain 
ft. The VC dimension is an important measure of 
combinatorial complexity of ^ . A finite set A C 
ft is shattered by ^ if every subset B C A can 
be "carved out" of A with the help of a suitable 
element C of c tf: 

B = ADC. 




A o • B • C 



Figure 7: To the concept of a set A shattered by a class c tf . 

The VC dimension of % denoted VC^), is the 
supremum of cardinalities of all finite subsets of the 
domain which are shattered by ^ . Here are some 
classical examples. 
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of members of "jf 





Proofs can be found e.g. in Ch. 4. 

Estimating the VC dimension of a particular fam- 
ily of sets is often a non-trivial task. For example, 
the value of this parameter does not seem to be 
known for the collection of all cubes in M d with sides 
parallel to the coordinate hyperplanes. More gener- 
ally, it is tempting to conjecture that the VC dimen- 
sion of the family of all balls (either open or closed) 
in a Banach space of finite dimension d equals d+1, 
but the author is unaware of any results beyond the 
Euclidean case. 
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3.2. Uniform convergence of empirical measures 

Recall that the Borel sigma-algebra of subsets of 
a metric space il is the smallest family closed un- 
der countable intersections and complements and 
containing all open balls. Elements of the Borel 
sigma-algebra are called simply Borel subsets. We 
will restrict our attention to those families ft whose 
elements are Borel subsets of ft. This assumption 
guarantees that the value /i(C) is well-defined for 
every probability measure /x on Q. 

The empirical measure of C € ^ with regard to a 
finite sample X = {xi, . . . ,x n } is just the normal- 
ized counting measure 

/i n (C) = -\{i: Xi g C}\. 
n 

The VC dimension of ^€ is finite if and only if, with 
high confidence, the empirical measures of every 
Cef converge uniformly to the true value /i(C) as 
the sample size n goes to infinity, no matter what 
the underlying measure [i is. 

Here is a more exact formulation. A class has 
the property of uniform convergence of empirical 
measures, or is a uniform Glivenko-Cantelli class, 
if there is a function s{5,e) {sample complexity of 
the class) so that, given a desired precision value 
e > and a risk level 5 > 0, whenever n > s(8, e), 
one has 

sup p(su P \n{C) - fin(C)\ >e\ <S. 

Here -P(Sl) denotes the family of all probability 
measures on f2. We quote the following as stated in 
[E3, Theorem 7.8. 

Theorem 3.1 (Uniform Glivenko-Cantelli theo- 
rem). A concept class is uniform Glivenko- 
Cantelli if and only if d = VC^ ) < oo, in which 
case 



(&d 8e 4 2 
s{5, £) < max < — lg — , - lg - 
I e e e o 



One of the components of the proof is the concen- 
tration of measure in the Hamming cube {0, 1}™. 

Let us remark that similar results can be stated 
and proved for function classes, that is, collections 
& of functions from the domain fi to the inter- 
val [0,1]. The role of VC dimension is taken over 
by other combinatorial parameters, such as the fat 
shattering dimension. We will not enter into de- 
tails. 



Among a great selection of books treating VC 
theory, let us mention encyclopaedic sources 57 j 
and [lij], a classical monograph [H3|, and a lighter, 
but very well- written Q. 



4. The curse of dimensionality 

4-.1. Pivot tables 

4.. 1.1. Reduction and access overhead 

Let (CI, p, X) be a similarity workload, T a met- 
ric space, and /: — > T a 1-Lipschitz function. If 
queries in T are easier to process than in fi, then it 
makes sense, given a range query (q,s) in (fl,X), 
to run a (f(q), e) range query in (T, f(X)), retriev- 
ing all datapoints x with f(x) within the distance 
of £ of f(q), and then check them against the con- 
dition pa(q, x) < £. The 1-Lipschitz property of / 
guarantees that no true hits will be missed. 

In this way, the function / can be viewed as a 
projective reduction of the exact similarity search 
problem to the new workload (T, f(X)). This view- 
point is developed in some detail e.g. in 46]. The 
access overhead of the reduction / is defined as 

accjte) = \Xr\ r\B s (f(q))) I -\Xn B £ (q)\. 

This simple and well-known idea on its own can be 
surprisingly efficient, cf. [50(. 

4-1-2. Pivot-based reduction to £°°(k) 

Every finite collection fx, f%, . . . , ff. of 1-Lipschitz 
functions on (Q, p) defines a 1-Lipschitz mapping 
/ = Af =1 fi from Q to £°°(k) via the formula 

f(x) = (fx(x),f 2 (x),...,f k (x)). 

Here £°°(k) is the vector space R fc equipped with the 
norm ||a;|| = max^ =1 \x{\. If the fa are distance 
functions from pivot points pi £ fi, the resulting 
mapping / is of the form 

f(x) = (d(x, Pl ),...,d(x, Pk ))£l°°(k). (5) 

In [BH ] , it was suggested to use a reduction of this 
form in case where the distance computations in Q 
are so expensive that even a simple sequential scan 
of the image f(X) in £°°(m) is computationally 
cheaper. This idea was analyzed for more general 
similarity measures than metrics in 16]. By com- 
bining it with other access methods on the space 
£°°(m), further new indexing methods have been 
developed, see e.g. 0. 
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A m-NN similarity query is processed in (fi, d, X) 
in time 

k + 1 + (&cc f(q) + to). 

Here the first term stands for the calculation of k 
distances from a query point q to the pivots and 
I is the processing time of a rectangular query in 
f°°(fc), while the latter expression lists the number 
of distance computations in fl needed to separate 
false hits from k true positives. A classical paper 
on optimizing the pivot selection is 0. 

4- 1.3. Lower query time bounds for pivot tables 

Our next result (a slightly corrected version of 
the main theorem in [Hi}) is valid not only for the 
Hamming cube which is a testbed for asymptotic 
analysis of performance of indexing schemes, but 
also for the Euclidean space K d with the gaussian 
measure, the cube [0, l] d , and so forth. 

Theorem 4.1. In addition to the assumptions of 
Subs. \2.2l suppose also that the VC dimension of 
the family of all balls in Q is 0(d). Any pivot ta- 
ble with k = o(n/dlogn) pivots will return an ex- 
pected average number of Q(n) datapoints. Conse- 
quently, the average total complexity of the perfor- 
mance of any pivot table for the resulting workload 
is fi (n/d\ogn). 

Proof. Assume the number of pivots k is 
o(n/d log n). Let Sm denote the median value 
of the function Snn, so that for at least half query 
points q the distance to the NN in X is > £m- For 
each pivot pi, i — 1,2, ... ,k, denote pf 1 the median 
value of the distance function p(pi,—). Because of 
concentration, the mesure of the spherical shell 

Si = {q- p? - e M /2 < p{ Pt , q) < pf + S M /2} 

is 1 — exp(— £l(e 2 M d)), and the complement to the 
intersection, S = DiSi, of all k shells has measure 

o(n/d) exp(-fi(4f0) = exp(-Q(e 2 M d)), 

since n is subexponcntial in d. Thus, among all k- 
fold intersections of spherical shells (Figure [5]), we 
have found a giant one, whose ^(-measure is nearly 
one. 

To assure that this intersection contains an ac- 
cordingly high proportion of datapoints, consult the 
table in Subs. 13.11 to deduce that the family of 
all /c-fold intersections of spherical shells in ft has 
VC dimension not exceeding 2k \g(ek)0(d) = o(n). 




Figure 8: An intersection of spherical shells. 



By Theorem 13. 1[ the empirical measure pL n {S) ap- 
proaches p{S) and therefore 1 with high confidence 
as d — ¥ oo . 

The measure of the set Q of query points q G S 
whose distance to the nearest neighbour in X is 
greater than or equal to em is at least 1/2 — 
exp(— 0(d)e 2 ). For every non-empty range query 
(q, e) where q € Q, all datapoints belonging to 
S, that is, most datapoints of X, have to be re- 
turned. This gives an expected average total com- 
plexity Q(n) under our assumption on the number 
of pivots. □ 

Notice that we allow the pivots pi to be arbitrary 
points of the domain f2. If we require that pivots 
be chosen from the dataset X, then the set S in 
the above proof will with high confidence contain 
n — k datapoints by Theorem 12.11 and Proposition 
12.21 and we obtain (without using VC theory): 

Corollary 4.2. Under the assumptions of Subs. 
\2.'SX if all pivots pi belong to the dataset X , then 
the expected total complexity of the performance of 
the resulting pivot table is n(l — o{n)). 

4- 1-4- A remark on results of fldl l 

The above lower bounds agree with an exponen- 
tial in d up per bound of k+c d derived in the influen- 
tial paper [16j, Theorem 3 within a similar model, 
with no restriction on a number n of datapoints, 
and with d a dimension parameter defined by a cer- 
tain measure distribution density condition verified, 
e.g. by the Hamming cube {0, l} d or the Euclidean 
sphere S> d . Here c is a constant depending on Q,, 
the smallest distortion parameter of a 1-Lipschitz 
embedding /: fi -> l°°{k): 

Vx,y E X, 

\\m ~ fm\oo < pfav) < c \\f( X ) - mw^ . 
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However, the usefulness of the result is limited be- 
cause of an imprecise claim (loc. cit., Example 1) 
that for a bounded subset X of l 2 (d) there always 
exists a 1-Lipschitz function f:X—> £°°(d + 2) hav- 
ing distortion c < 2. In fact, an optimal constant 
here is on the order f^v^) (see | Appendix B| . As 
a result, the query performance estimate for the 
Euclidean domains made in Remark after the main 
Theorem 3, loc. cit., becomes superexponential in d 
and thus meaningless. 

This misconception has led to some further con- 
fusion, cf. remarks made in Q (p. 2358, end of 
first paragraph on the r.h.s., and at the beginning 
of Section 5) . 

4-. 2. Hierarchical metric tree schemes 

4-2.1. Metric trees 

For a finite rooted tree T we denote L(T) the set 
of leaves of T and I(T) the set of inner nodes. The 
symbol * will denote the root node of T. 

Let & be a class of 1-Lipschitz functions on ft 
(possibly partially defined). A metric tree (of type 
j?) for a workload (O, p, X) is a hierarchical index- 
ing structure consisting of 

• a finite binary rooted tree T, 

• an assignment of a function / t £ # (a pruning, 
or decision function) to every inner node t € I(T), 
and 

• a collection of subsets Bt C f2, t € L(T) (bins), 
covering the dataset: X C U t eL(T)Bt- 

Since we assume that the tree T is binary, it can 
be identified with a sub-tree of the prefix tree, that 
is, a subset of binary strings E1E2 ■ ■ ■ £fc, < k < n, 
where = ±1 for all i. 

At each inner node t — E\Ei . . .Ei the value of 
the pruning function f t at the query center q is 
evaluated. The condition f t (q) > £ gurantees that 
the child node t(— 1) = £i£2 • • • £;(— 1) need not be 
visited, because all elements x of the bins indexed 
with the descendants of t(— 1) are at a distance > e 
from q. Indeed, assuming x G B £ (q), one has 

\ft(x) - h(q)\ <d(x,q) <£. 

Similarly, if ft(q) < —£, then the node tl = 
£\£2 ■ ■ ■ £fcl can be pruned, because no bin labelled 
with descendants of tl can possibly contain a point 
within the range £ from q. 

However, if ft(q) G [— e,e], then no pruning is 
possible and both children nodes of t have to be 
visited. The search branches out. In the presence 



of concentration, the amount of branching is con- 
siderable, and results in dimensionality curse. 

The M-tree [l(| is by now a classical example of 
a metric tree. However, metric tree- type indexing 
schemes are very numerous, cf. Sections 2.1-2.4 in 
[Hi] or Section 4.5 in 0- 

4-2.2. Lower bounds for metric trees 

For a function / and a real number t, denote 
l/<i = {iedom(/):/(a!)<t}. 

Theorem 4.3. In addition to the assumptions of 
Subs, \2.2l let be a class of 1-Lipschitz functions 
on the domain Q such that the VC dimension of the 
family of sets l/<t, / G & , t G R is poly(ii). Then 
the expected average performance of every metric 
tree indexing structure of type is superpolynomial 
in d. 

That the above combinatorial assumption on the 
class & is sensible, follows from a theorem of Gold- 
berg and Jerrum [l9| . Consider a parametrized 
class 

^ = {x^ f(d,x):d ER S } 

for some {0, l}-valued function /. Suppose that, for 
each input iel s , there is an algorithm that com- 
putes f(9,x), and this computation takes no more 
than t operations of the following types: 

• the arithmetic operations +,— , x and / on real 
numbers, 

• jumps conditioned on >,>,<,<, =, and ^ com- 
parisons of real numbers, and 

• output or 1. 

Then VC(Jf) < 4s(t + 2). 

Essentially, the above result states that a class of 
binary functions that can be computed in polyno- 
mial time taking a parameter value of polynomial 
length will have a polynomial VC dimension. 

On the proof of Theorem 14.31 (For de- 
tails, see [ia].) Suppose the conclusion is false, 
and fix a particular poly(d) rate, f(d), bounding 
from above the performance of a metric tree on any 
sample path. As the total content of bins B t in- 
dexed with strings t of length exceeding the rate 
f(d) has to be asymptotically negligible, we can as- 
sume without loss in generality that the indexing 
tree has depth f(d). 

Without loss in generality every bin can be re- 
placed with an intersection of a family of sets of 
the form l/<t, / G & and their complements. This 
provides a poly(d) upper bound on the VC dimen- 
sion on the family of all possible bins. 
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With high confidence, a bin of a large measure 
will contain many data points, contradicting the 
poly(d) performance bound. This leads to conclude 
that measures of bins cannot be too skewed. Now 
concentration of measure is used to prove that at 
least poly(d) bins B t have size so large that the Em- 
neighbourhood of B t has almost full measure. One 
deduces further that query centres q whose Enn~ 
neighbourhood meets at least d"^ 1 ) bins have mea- 
sure > 1/2 — o(l). Processing a nearest neighbour 
query with such a centre q requires accessing all 
of these bins, let even to verify that some of them 
are empty. This leads to a contradiction with the 
assumed uniform performance bound on the algo- 
rithm. □ 

4-3. The curse of dimensionality conjecture 
4-.3.1. The problem 

Of course the above are just particular results 
only applicable to specific indexing schemes. If one 
wants to validate the curse of dimensionality once 
and for all, here is an interesting open problem. 

Conjecture 4.4 (cf. [23(] ) . Let X be a dataset with 
n points in the Hamming cube {0, l} d . Suppose d — 
n W and d = w(logn). Then any data structure 
for exact nearest neighbour search in X , with d°^ 
query time, must use n^ 1 ' space. 

The data structure and algorithm are understood 
in the sense of the cell probe model of computation 

(cf. [Hi). 

4-3.2. Cell probe model 

In the context of similarity search, the model can 
be described as follows. An abstract indexing struc- 
ture for a domain f2 consists of 

• a collection of cells C,;, indexed with a set /, 

• a dictionary T — W* over an alphabet W = 
{0, l} b , viewed as a rooted prefix tree, 

• a computable mapping 1 1— > i(t) from T to I (cell 
selector), and 

• a computable function / = ft{o~;q) (either par- 
tially or fully) defined on Tx{0, l} b xfiand taking 
values in W . 

For a t £ T, one can think of each f t as a function 
defined on a subset of f2 and taking a 6-bit string a 
as a parameter, except if t — * is the root. A value 
/t(<r; q) is a child s of the node t. 

For every i, the cell Cj can hold a 6-bit string. 
Sometimes b is regarded as constant, but often it is 



assumed that b = O(lgn), so that a cell correspond- 
ing to a leaf node can store a pointer to a datapoint 
x € X. Occasionally the nearest neighbour problem 
is replaced with a weaker decision version (known 
as near neighbour problem), whereby a range pa- 
rameter £0 > is fixed and the algorithm is ex- 
pected to tell whether there is an x € X at a dis- 
tance < £0 from the query point. In such a case, a 
leaf node cell Cj will hold a single bit (a "yes" or 
"no" answer). 

Building the data structure at the preprocessing 
stage, given a dataset X, consists in storing in every 
node cell a 6-bit string. 

A memory image of the indexing structure C{,i G 
/ is created when the algorithm is initialized. Given 
a query point q €E f2, the prefix tree T = W* is tra- 
versed down to the leaf level beginning with the 
root. At the inner node t, the content a of the 
cell Cift) is read and passed on to the function f t 
as a parameter. The computed value ft(o~;q) = 
s € W — {0, 1} & indicates a child of t to follow at 
the next step. When a leaf I is reached, the algo- 
rithm halts and returns the contents of Cjm . The 
query time is the length of the branch traversed, or 
equivalently the number of cell probes during the 
execution of the algorithm. The space requirement 
of the model is the total number of cells, |/|. 

The cell probe model is very liberal, as the cost 
of computing the values of / is disregarded. For 
this reason, any lower bound obtained under the 
cell probe will likely hold under any other model of 
computation. 

4-3.3. Current state of the problem 

The best lower bound currently known is 
0(d/ log — ), where s is the number of cells used by 
the data structure [4l|. In particular, this implies 
the earlier bound f2(d/logn) for polynomial space 
data structures Q, as well as the bound Q(d/ log d) 
for near linear space (namely nlog°^ n). 

5. Approximate NN search and dimension- 
ality reduction 

Approximate nearest neighbour search jiol ] is of- 
ten said to be free from the curse of dimension- 
ality, and the reason is that the (dimensionality) 
reduction maps / used in indexing are no longer 
1-Lipschitz. Rather, they are what may be called 
"probably approximately 1-Lipschitz" , and some- 
times only on a certain distance scale. Such maps 
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no longer exhibit a strong concentration around 
their means. The price to pay is that we may 
lose some relevant datapoints, as some distances 
are typically getting distorted, and so such maps 
cannot be used for exact NN search. 

5.1. Random projections in the Hamming cube 

Think of the Hamming cube {0, l) d as the set of 
all binary functions in the space £ x (d) = L 1 ([c?]), 
where [d] — {1,2, ... ,d} supports a uniform mea- 
sure. In other words, we normalize the Hamming 
distance d{x, y) = Xi ^ yi} by multiplying it by 
l/d. Of course such a normalization has no effect 
on similarity search. If the dataset X C {0, l} d 
contains n points, then the VC dimension of X, 
viewed as a concept class on {1,2, ... ,d}, does not 
exceed lg 2 n. According to the uniform Glivenko- 
Cantelli Theorem 13. 11 if 0(e~ 2 lg 2 n) coordinates of 
the Hamming cube are chosen at random, then with 
high confidence the restriction mapping from X to 
the Hamming cube {0, 1}°( £ fe") (under its own 
normalized Hamming distance) preserves the pair- 
wise distances to within ±e. Cf. Figure |H1 
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Figure 9: Histogram of distortions of all pairwise distances 
in a random dataset of n = 3, 000 points in the d = 500 
Hamming cube under a projection to a Hamming cube on 
randomly chosen k = 25 bits. 

The error of ±e is additive rather than multi- 
plicative, so the random sampling of the coordi- 
nates is only appropriate for ANN search in the 



range on the order of d/2. The construction has to 
be generalized for all possible ranges I = 1, 2, . . . , d. 
Such a generalization was developed in [29| . 

Projecting on a randomly sampled subset of k co- 
ordinates of the Hamming cube essentially amounts 
to a linear transformation x i— > xA, where A is a 
d x k matrix with i.i.d. Bernoulli entries assuming 
values 1 and with probabilities l/d and 1 — 1 jd, re- 
spectively. (The operations are carried mod 2.) One 
of the key observations of [29[ — in the form given 
to it in [54l | . 7.2 — is that if the probability l/d is 
replaced with 1 / '£, then a random linear transforma- 
tion x i — y x A mod 2, under a suitable normalization, 
preverves distances on the scale 1/2, £ = 1, 2, . . . , d, 
to within an additive error Is, and on a larger scale 
- away from it. Since the new cube only con- 
tains 2°( e lg2 n ' points, a hash table storing near- 
est neighbours, together with the reduction map 
/, produces an indexing scheme for Grange search 
taking space polynomial in n and answering (1+e)- 
approximate queries in time 0(e~ 2 lg 2 n). 

Another discovery of [2^| is that if on every scale 
£ one employs a sufficiently large series of indepen- 
dent projections onto fc-cubes, then with high con- 
fidence one can assure that every ANN query — as 
opposed to most ANN queries — will be answered 
correctly. Finally, a separate indexing scheme is 
constructed for every range £. The overall space re- 
quirement is still polynomial in n, and the running 
time of the algorithm is 0(<ipoly log(dn)). 

5.2. Random projections in the Euclidean space 

Let S d_1 denote the Euclidean sphere of unit ra- 
dius in the space IR d . The projection tti on the first 
coordinate is a 1-Lipschitz function. For all pairs 
of points x,y £ S d_1 , one has |7Ti(x) — ^i{y)\ < 
|| a; — y\\ , and for exactly one pair of antipodal points 
the equality is achieved. Now let x,y £ E> d ~ 1 be 
drawn at random. What is the expected value of 
the distortion of distances 1 7Ti (x) — m (y) \ / \\x — y\\ ? 

Figure [10] shows that for a vast majority of pairs 
of points, the projection distorts distances by the 
factor Q(l/\/~d). A geometric explanation, at least 
at an intuitive level, is simple. Two randomly cho- 
sen points on the high-dimensional sphere, because 
of concentration of measure, are at a distance ~ \[2 
from each other. At the same time, half of the 
points of the sphere project on the interval of length 
0(1 /y/d), and so are contained in the equatorial re- 
gion (Figure ITT]) . 

It follows that the expected absolute value of 
the norm of a projection of a given point i in a 
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Figure 10: The expected distortion of one-dimensional pro- 
jection of the d-dimensional sphere S d— 1 over all pairs of 
points. 

random direction is of the order Q(l/y/d). Now 
let X = {xi, . . . ,x n } be a finite subset of points 
of the sphere. Denote Y the set of all vectors 
of the form Xi — Xj whose length is normalized 
to one. Each y £ Y can be identified with the 
function y: z i-> \{y, z)\ on the unit sphere. If we 
now think of S d as the domain (consisting of one- 
dimensional projections), then Y plays the role of 
a finite function class. Just like for finite concept 
classes, the combinatorial dimension of Y is of the 
order O(logn), and so, by VC theory, the empirical 
mean on a random sample of 0(e _1 logn) direc- 
tions will estimate the expectations of all y, y G Y 
to within a factor of e with high confidence. A small 
number of randomly chosen directions are likely to 
be nearly pairwise orthogonal because of concentra- 
tion, so we can instead choose an orthogonal pro- 
jection to a randomly chosen space of dimension 
0(£ _1 lgn). Since the projection is a linear map, 
we get the same estimate, but with a multiplica- 
tive error e, for all pairwise distances between the 
points of X. It remains to work out the meaning of 
the empirical mean in the above setting in order to 
obtain the following famous result. 

Theorem 5.1 (Johnson-Lindenstrauss lemma 
25]). Let e G (0, 1/2) be a real number, and X = 



{x\, X2, ■ ■ ■ ,x n } be a set of n points in R". Let k 
be an integer with k > Ce~ 2 log n, where C is a 
sufficiently large absolute constant. Then there is a 
mapping f: E™ — > M. k such that 

(1 - e) \\f( Xi ) - f[xj)\\ < \\xi-XjW 

< (1 + 6)11/(0:0-/(^)11 

for all i,j = 1,2, ... ,n. Moreover, as f, one can 
with high confidence choose a suitably renormalized 
random projection from M™ to a k-dimensional Eu- 
clidean subspace. 

An even simpler proof using concentration can 
be found in [31|, Section 15.2, and an up-to-date 
survey of the lemma, in [32J ■ 

The normalized projection is not quite as good as 
a genuine 1-Lipschitz map, because the distortion 
of a distance can exceed one, and on rare occasions 
very considerably. Yet, as a reduction mapping for 
approximate NN search, the projection map is quite 
OK. And its histogram is concentrated no more. 
This explains the efficiency of the random projec- 
tion method for approximate NN search. Com- 
bined with a suitable indexing scheme in a lower- 
dimensional space M fc , or rather a collection of such 
schemes, the random projection method leads to an 
efficient indexing scheme for an (l + e)-approximate 
NN search (Indyk and Motwani (23|). 

The articles [2i| and [24[ have appeared inde- 
pendently and at about the same time, and after- 
wards the dimensionality reduction methods have 
been shown [l| to be near optimal in the cell probe 
model. 



11 



6. Concluding remarks 

6.1. Intrinsic dimensionality 

Merits of asymptotic analysis of indexing algo- 
rithms using artificial datasets sampled from the- 
oretical high-dimensional distributions should be 
clear from [38[. At the same time, it is an often 
held belief that the real data does not have very 
high intrinsic dimension. This corresponds to the 
existence of 1-Lipschitz functions that are highly 
dissipating. Figure [T^] shows the distance distribu- 
tion to the points of the SISAP benchmark dataset 
of NASA images X C £ 2 (20) of 40, 149 vectors in 
a 20-dimensional Euclidean space [6, 49] from a 
highly dissipating pivot, selected from a gaussian 
cloud around X with standard deviation on the or- 
der of the tolerance range e — 0.275 retrieving on 
average 0.1% of data. This has to be compared to 
Figure El 
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Figure 12: Empirical density histogram of distances from a 
pivot having the highest found value of dissipation for the 
NASA dataset. Vertical lines mark the mean ± tolerance 
range e = 0.275. The e-dissipation (0.747) is the area outside 
of extreme lines. 

Of a great variety of approaches to intrinsic di- 
mension [l3j], at least two specifically measure the 
amount of concentration in data. The first one is 
the intrinsic dimension by Chavez et al. Q 



dim dist (X) 



1 



2var (d) 



(6) 



The second is the concentration dimension, studied 
within an axiomatic approach of 4^, 44 [ : 



dim a (A) = 



2 Jo la *( £ ) d£ 



2 ' 



(7) 



(In both cases we assume that CharSize(X) = 1.) 
The value ([7]) is convenient for asymptotic analysis 
in the spirit of this paper, but is nearly impossi- 
ble to estimate for a given dataset. On the other 
hand, ^ is readily calculated by sampling (e.g. 
dimdist(X) = 5.18 for NASA images) and forms 
a good statistical estimator for the dimension of 
the hypothetical underlying measure \i in the most 
(only?) interesting case where metric balls have low 
VC dimension. The shortcoming of © is that the 
parameter estimates the concentration/dissipation 
behaviour of a typical pivot distance function, while 
it is a few most dissipating pivots that really matter 
for indexing. One may envisage the emergence of 
further concepts of intrinsic dimension in the same 
spirit, such as the local dimension of Ollivier (39|, 
Definition 3. 

6.2. Black box search model and Urysohn space 

The black box model of similarity search was stud- 
ied by Krauthgamer and Lee [28j . Given a metric 
space (instance) (X, d) , a query is a one-point met- 
ric space extension X U {q}, where the distances 
d(q,x), x € X are accessible via the distance ora- 
cle. Each d(q, x) can be evaluated in constant (unit) 
time. A preprocessing phase is allowed, under the 
condition that an indexing scheme occupies poly (n) 
space. The efficiency of an algorithm for (exact 
or approximate) similarity search is estimated as a 
number of calls to the distance oracle necessary to 
answer a query. 

This is a "black box model" in the sense that, for- 
mally speaking, there is no obvious domain (though 
we will see shortly that the domain is a well-defined 
separable metric case, and the setting is, in fact, 
classical) . A remarkable feature of the model is that 
the problem of characterizing workloads admitting 
approximate NN queries in terms of an intrinsic di- 
mension parameter receives a complete answer. 

Recall that the Assouad (or doubling) dimension 
of a metric space (X, d) is the minimum value p > 
such that every set A in X can be covered by 2 P 
balls of half the diameter of A. (The diameter of a 
set A is the supremum of d(x, y), i,t/£ A.) Denote 
this parameter by dimdu(X). 
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Theorem 6.1 (Krauthgamer and Lee [28|). A 

metric space (X, d) admits an algorithm requirying 
poly (n) space and taking polylog (n) time to an- 
swer a (1 + e)- approximate nearest neighbour query, 
where e < 2/5, if and only if 

dim dM (X,d) = O (log log n). 

Here we will show that, on the contrary, an exact 
NN search in this context exhibits the curse of di- 
mensionality even if the metric space (X, d) is con- 
tained in the unit interval [0, 1] with the usual dis- 
tance. With this purpose, we first convert the black 
box model into a conventional setting of searching 
in a metric domain. 

The universal Urysohn metric space, u, USUI is 
a complete separable metric space uniquely defined 
by the one-point extension property: suppose X is 
a finite subset of U and q a one-point metric space 
extension of X . Then U contains a point q' so that 
the distances from q and from q' to any point x G X 
are the same. 




Figure 13: One-point extension property. 



for exact similarity search in X within the black 
box model will take the worst case time n. 

The result is true for simple information-theoretic 
reasons. We will produce for every k < n a query 
q" with a uniquely defined nearest neighbour in X 
which cannot be answered in time k. 

Without loss in generality, we can assume that 
diam (X) — 1. Let initially q be a query having 
the property that d(q,x) = 1 for all x G X. Sup- 
pose that the algorithm has made k < n calls to 
the distance oracle. Denote xj., £2, • • • > %k S X the 
points whose distance to q has been accessed. Since 
d{q, Xi) = 1 for all i < k, the algorithm clearly can- 
not halt at this stage. Let Q be the set of all q' G U 
with 1 = d(q', Xi) for alH = 1, 2, . . . , k. Since the 
algorithm is deterministic, we can replace q with 
any q' G Q, and the sequence of executed calls to 
the oracle up until the step k will be the same. 

Now denote Y — {xi, X2, ■ ■ ■ , Xk} and fix an xq G 
X \ Y. The function 

f{x) = max{l — d(x, Y), d(x$,Y) — d(x, xq)} 

is Katetov, and thus it is the distance function 
from some q" . Clearly, q" G Q, and q" admits a 
unique nearest neighbour in X, namely xq. Thus, 
the search cannot be concluded in k steps even if it 
started with the well-defined query q" . □ 
If one requires the queries to follow the same un- 
derlying distribution as datapoints, the problem be- 
comes more subtle, and we do not know the answer. 



An equivalent definition is that if X C U is finite 
and /: X — > M. satisfes 

\m-f(y)\<d x (x,y)<f(x) + f(y) (8) 

for all x, y G X, then there is q G U with f(x) = 
d(q,x) for all x G X. (The functions satisfying ([8]) 
are called Katetov functions.) 

This remarkable object has recently received 
plenty of attention in metric geometry. It is a ran- 
dom, or generic, metric space, in a sense that by 
equipping the integers with a randomly chosen met- 
ric p and taking a completion, one obtains U almost 
surely 55] . The space U contains an isometric copy 
of every separable metric space SI. For this rea- 
son, one can use U as a "universal domain," and 
the black-box model can be restated as a classical 
similarity search problem in the domain Q = U. 

Theorem 6.2. Let X be a finite metric space. De- 
note n — \X\. Then any deterministic algorithm 



6.3. Indexing via Delaunay graph 

Here is an example of an indexing scheme for ex- 
act similarity search which is still "distance-based" 
but of a rather different type from either pivots or 
metric trees. 

The Voronoi cell V(x) of a datapoint x G X in a 
metric domain SI consists of all points q G SI hav- 
ing x as the nearest neighbour. The Delaunay graph 
has X as the set of vertices, with x, y being adjacent 
if their Voronoi cells intersect. Suppose the domain 
has the property that every two points x, y G fi can 
be joined by a shortest geodesic path, not necessar- 
ily unique. (All the domains previously considered 
in this article are such, including even the Urysohn 
space.) Then for any q G SI and x G X, either x is 
the nearest neighbour to q, or else one of the data- 
points y Delaunay-adjacent to x is strictly closer to 
q than x is. (Proof: start moving along a shortest 
geodesic from x towards q, cf. Figure H3J and use 
the triangle inequality.) 
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Figure 14: NN search using Delaunay graph. 

This observation turns the Delaunay graph of 
X in £1 into an indexing scheme for exact nearest 
neighbour search. Denote S x the list of points ad- 
jacent to each x £ X. Given a query q, start with 
an arbitrary xq £ X, and find 

x\ = arg min d{q 1 y). 

If x\ ^ xq, move to x\ and repeat the procedure. 
Once Xi+\ — Xi, the algorithm halts and returns 
Xi. This algorithm, already mentioned in 12j, was 
studied for general metric spaces by Navarro [37| ] . 
See also 0, 4.1.6. 

In order for the algorithm to be efficient, the av- 
erage vertex degree of the Delaunay graph has to 
be small. Navarro had observed {loc. ext., Theorem 
1) that this is not the case in general metric spaces. 
Specifically, he proved that for every two elements 
there exists a finite metric space Y = Y a ^ 
containing X as a subspace in which a, b £ X are 
connected in the Delaunay graph of X. The result 
by Navarro translates immediately into: 

Theorem 6.3. Let X be a finite metric subspace of 
the universal Urysohn space U. Then every two el- 
ements a,b £ X are adjacent in the Delaunay graph 
of X in U. 

In fact, the same remains true in less exotic situ- 
ations, as one can deduce from Proposition ^. 21 that 
if fi be either R™, or the sphere S™, or the Hamming 
cube, then under the assumptions of Subs. 12.21 the 
Delaunay graph of X is, with high confidence, a 
complete graph on n vertices. 

Thus, the indexing scheme in question still suf- 
fers from the curse of dimensionality because of con- 
centration of measure considerations, but the argu- 
ment seems to be of a different nature from that ei- 
ther for pivots or for trees. What would a common 
proof for all three types of schemes look like? This 
highlights the difficulty of obtaining in a uniform 
way lower bounds for all possible "distance-based" 



indexing schemes (after they are formalized in a 
suitable way), not to mention an even more gen- 
eral setting of the cell probe model for all possible 
indexing schemes. 

This having said, for real data the complexity 
of the Delaunay graph is lower than in an artificial 
asymptotic setting, and Voronoi diagrams are being 
successfully used for data mining algorithms in high 
dimensions, cf. 51]. 

In fact, it would be interesting to investigate the 
performance of the spatial approximation algorithm 
in hyperbolic metric spaces. Recall that a metric 
space X in which every two points x, y can be joined 
by a geodesic segment [x, y] is hyperbolic (in the 
sense of Rips) 52] if there exists a S > so that 
every geodesic triangle is S-thin: each side [x, y] is 
contained in the ^-neighbourhood of the two other 
sides, [x,z] and [y,z] (Figure [15]) . 




Figure 15: A 5-thin geodesic triangle. 

Alain Connes has conjectured in [7|, pp. 138- 
141 that a long-term human memory is organized 
as a hyperbolic simplicial complex, where a search 
is performed in a manner similar to the above. 

Appendix A. Proof of the empty space 
paradox (Theorem 12.11 and 
Proposition I2.2|) 

Without loss in generality, normalize the observ- 
able diameter of fl to one. Let u> € Q. The distance 
function p(w, — ) is 1-Lipschitz and so concentrates 
around its median value, R(uj). The resulting func- 
tion R: ft — > M, lj M> R(ll>) is also 1-Lipschitz, and 
concentrates around its median, Rm. It is easy 
to check that under our assumptions, the differ- 
ence between the mean and the median of every 
1-Lipschitz function / on Q converges to zero as 
0(^fd) (uniformly in /). Thus, without a loss in 
generality, we can assume that, with high confi- 
dence, Rm — > 1 as d — > oo. Notice that the above 
argument concerns the domain and not a particular 
dataset. 
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To prove Proposition 12.21 fix e > and sam- 
ple an instance of data, X. With confidence 
1 — nexp(— 0(d)e 2 ), one has \R(x) — 1] < e/2 for 
all ie € X. Moreover, since the datapoints are 
sampled in an i.i.d. fashion, by the union bound 
one has with confidence 1 — n 2 exp(— 0(d)e 2 ) that 
\p(x,y) — R(x)\ < e/2 for every pair x, y £ X. Since 
n = \X\ is subexponential in d, the statement fol- 
lows. 

To prove Theorem 12.11 again fix e > 0. Denote 
Em the median value of the function Enn- Suppose 
liminf^oo em < 1. Proceed to a subsequence of 
domains and find 7 > with em < Ru — 7 for all 
d. The probability that R(ui) deviates from Rm by 
more than 7/2 is exponentially small in d. Since 
71 = \X\ only grows sub-exponentially in <i, with 
confidence 1 — cxp(— 0(e 2 d)) one has for every x £ 
X: 



R(x) - £ M > 
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Now we use a technical observation from [21|: if 
A C fl is such that fi(A) > 0:0,(7) for some 7 > 0, 
then fi(Aj) > 1/2. It follows that 

fi(B eM (x)) < a( 7 /2) = exp(-0(e 2 d)), 

and therefore 

p{X £M ) < nexp(-0(£ 2 d)) = exp(-0(£ 2 d)), 

which contradicts the definition of em- This im- 
plies: liminfrf^oo e m > 1. 

To establish the converse inequality 
lim sup^QQ em < 1, recall that a ball of ra- 
dius R(ui) centred at uj has measure > 1/2, and so 
we have an obvious estimate Em < rnin^gjf R{x). 
The rest follows from concentration of the function 
R around one. 



Appendix B. Distortion of Lipschitz em- 
beddings £ 2 (d) ^ £°°(d + k) 



Lemma IB. II Fix k. Let c > be a constant hav- 
ing the property that for every d and each bounded 
subset X of £ 2 (d) there exists a 1-Lipschitz func- 
tion f:X—t £°°{d + k) having distortion c: for all 
x,y £ X, 

\\m-m\L < \\x-y\\ 3 (B.l) 
< c\\f(x)~f(y)\\ 00 . 

Then c = f2(\/ d + k/Vk), that is, fl(Vd) with a 
constant depending on k. 



The proof consists of a series of statements. 

1. There exists a 1-Lipschitz function f:£ 2 (d) — > 
e°°(d + k) with the property fB.l)) . 

For every n £ N, choose a function /„ from 
the closed n-ball B n (0) in £ 2 (d) to the n-ball in 
£°°(d + k) with distortion c. The Banach space ul- 
trapowers of both participating spaces formed with 
regard to a non-principal ultrafilter on the integers 
(see e.g. page 55 in [26]) are isometric, respec- 
tively, to £ 2 (d) and £°°{d + k), because the spaces 
in question are finite-dimensional. The family of 
1-Lipschitz functions (/„) determines in a stan- 
dard way a 1-Lipschitz function, /, from £ 2 (d) to 
£°°(d + 2), with the property (IB.1[) being preserved. 

2. There exists a linear function f with the prop- 
erty nn\) . 

Choose / as in 1. According to the Rademacher 
theorem (cf. a discussion and references on p. 42 
in [26jp. / is differentiable almost everywhere with 
regard to the Lebesgue measure. The differential 
of / at any point, which we denote T, is a linear 
operator of norm one having property (|B.1[) . In 
particular it is injective (though of course not onto) , 
and the inverse has norm < c. 

Recall that the multiplicative Banach-Mazur dis- 
tance between two normcd spaces E and F of 
the same dimension is the infimum of all numbers 
||T|| • where T ranges over all iso morp hisms 



£ 2 (d) 
k) is 



between E and F. (See [26j, p. 3, and [18|, 7.2) 
From the previous observation, we conclude: 

3. The Banach-Mazur distance between 
and some d-dimensional subspace of £°°(d -+ 
< c. 

4. The Banach-Mazur distance between £ 2 (d+k) 
and£°°(d + k) is 0{Vkc). 

There is a projection p from £°°{d + k) having 
T{£ 2 {d)) as its kernel and such that ||p|| < Vk and 
|| 1 — p|| < Vk (combine [l4j |. Corollary on page 209, 
with a classical result of Kadec and Snobar on pro- 
jection constants, cf. .26], p. 71). The Banach- 
Mazur distance between £ 2 {k) and any other k- 
dimensional normed space, including the kernel of 
p, is 0(Vk). Choose an isomorphism S realizing 
this distance, then it is easy to verify that T © S 
realizes the distance 0(y/kc) between £ 2 {d + k) and 
£°°(d + k). 

Finally, the Banach-Mazur distance between 
£ 2 (d + k) and £°°(d + k) is VdTk (cf. [Uj], p. 766). 
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